This report is submitted by Greeshma Jeev Koothuparambil and Olayemi Morrison as a part of Laboratory 2 of Visualization (732A98) Course for the 2023 Autumn Semester.

Assignment 1

Following are the libraries used for the successful completion of this assignment:
ggplot2
gridExtra
dplyr
ggpubr
plotly
grid

Here is how we loaded our libraries:

library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggpubr)
library(plotly)
library(grid)

Reading the dataframe Olive.csv into the file

#reading the file.

df <-  read.csv("olive.csv")

The loaded dataframe looks like this:

X Region Area palmitic palmitoleic stearic
1 1 North-Apulia 1075 75 226
2 1 North-Apulia 1088 73 224
3 1 North-Apulia 911 54 246
4 1 North-Apulia 966 57 240
5 1 North-Apulia 1051 67 259
6 1 North-Apulia 911 49 268
oleic linoleic linolenic arachidic eicosenoic
7823 672 36 60 29
7709 781 31 61 29
8113 549 31 63 29
7952 619 50 78 35
7771 672 50 80 46
7924 678 51 70 44

It has 572 observations and 11 variables namely:
X, Region, Area, palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic and eicosenoic.

The Summary of the table is as follows:

##        X             Region                   Area        palmitic   
##  Min.   :  1.0   Min.   :1.000   South-Apulia   :206   Min.   : 610  
##  1st Qu.:143.8   1st Qu.:1.000   Inland-Sardinia: 65   1st Qu.:1095  
##  Median :286.5   Median :1.000   Calabria       : 56   Median :1201  
##  Mean   :286.5   Mean   :1.699   Umbria         : 51   Mean   :1232  
##  3rd Qu.:429.2   3rd Qu.:3.000   East-Liguria   : 50   3rd Qu.:1360  
##  Max.   :572.0   Max.   :3.000   West-Liguria   : 50   Max.   :1753  
##                                  (Other)        : 94                 
##   palmitoleic        stearic          oleic         linoleic     
##  Min.   : 15.00   Min.   :152.0   Min.   :6300   Min.   : 448.0  
##  1st Qu.: 87.75   1st Qu.:205.0   1st Qu.:7000   1st Qu.: 770.8  
##  Median :110.00   Median :223.0   Median :7302   Median :1030.0  
##  Mean   :126.09   Mean   :228.9   Mean   :7312   Mean   : 980.5  
##  3rd Qu.:169.25   3rd Qu.:249.0   3rd Qu.:7680   3rd Qu.:1180.8  
##  Max.   :280.00   Max.   :375.0   Max.   :8410   Max.   :1470.0  
##                                                                  
##    linolenic       arachidic       eicosenoic   
##  Min.   : 0.00   Min.   :  0.0   Min.   : 1.00  
##  1st Qu.:26.00   1st Qu.: 50.0   1st Qu.: 2.00  
##  Median :33.00   Median : 61.0   Median :17.00  
##  Mean   :31.89   Mean   : 58.1   Mean   :16.28  
##  3rd Qu.:40.25   3rd Qu.: 70.0   3rd Qu.:28.00  
##  Max.   :74.00   Max.   :105.0   Max.   :58.00  
## 

1. Create a scatterplot in Ggplot2 that shows dependence of Palmitic on Oleic in which observations are colored by Linoleic. Create also a similar scatter plot in which you divide Linoleic variable into fours classes (use cut_interval() ) and map the discretized variable to color instead. How easy/difficult is it to analyze each of these plots? What kind of perception problem is demonstrated by this experiment?

#ploting the first graph
p<-ggplot(df,aes(x=palmitic, y=oleic, color=linoleic)) + 
  geom_point()+
  ggtitle("Dependency of Palmitic over Oleic based on Linoleic")

intervaldf <- data.frame(cut_interval(df$linoleic,n = 4))
colnames(intervaldf) <- "linoleicinterval4"

p1<-ggplot(df,aes(x=palmitic, y=oleic, color=intervaldf$linoleicinterval4)) + 
  geom_point()+ 
  ggtitle("Dependency of Palmitic over Oleic based on Linoleic Level")+
  labs(color = "Linoleic Interval")

The plot looks like this:

Analysis

In the first plot “p”, it is easy to assume that the darker shade means a higher value upon initial inspection, but the legend shows that as the color gets lighter, the value increases. Also, because this is plotted with continuous values, there are some shades that are too similar to distinguish. Overlapping points also causes loss of data. As a perception problem, we are unable to recognize, organize or interpret the data.
The second plot “p1” is much easier to analyze because the Linoleic variable has been mapped to color instead. A clear boundary can be seen for each category.


2. Create scatterplots of Palmitic vs Oleic in which you map the discretized Linoleic with fourclasses to:
a. Color
b. Size
c. Orientation angle (use geom_spoke() )
State in which plots it is more difficult to differentiate between the categories and connect your findings to perception metrics (i.e. how many bits can be decoded by a specific aesthetics)

#plotting the second graph
t1 <- textGrob("Comparison of different graphs on \n Dependency of Palmitic over Oleic based on Linoleic Level")
p1<-ggplot(df,aes(x=palmitic, y=oleic, color=intervaldf$linoleicinterval4)) + 
  geom_point()+ labs(color = "Linoleic Interval")
p2<-ggplot(df,aes(x=palmitic, y=oleic,  size=intervaldf$linoleicinterval4)) + 
  geom_point()+ labs(size = "Linoleic Interval")
p3<-ggplot(df,aes(x=palmitic, y=oleic)) +
  geom_point()+ geom_spoke(aes(angle = as.integer(intervaldf$linoleicinterval4)), radius = 50)
graphs <- arrangeGrob(grobs = list(t1,p1,p2,p3), ncol = 1,nrow = 4,heights = c(4,10,10,10))
palmiticXoleicXlinoleic <- as_ggplot(graphs)

The resulting graph looks interesting:

Analysis

The plots “p2” and “p3” in which the discrete variable plotted is based on size and spokes are really hard to differentiate between the categories. It is extremely difficult because the circles and spokes overlap and data is lost on this graph due to overplotting naking it difficult to distinguish the boundaries between observations. It is impossible to recognize, organize or interpret the data in this way.
According to the perception metrics colors can be recogonised upto 10 hues and here we are using only 4 making it easier to distinguish them. log4 = 2 bits are easier for perception. In the case of size based scatter plot even though it says we can recognise upto 5 sizes it is hard in here because of the overplotting issue. here even 2 bits of channel capacity seems hard. Same is the case with spokes. Even though human perception has a 3 bit capacity in analysing line orientation overlapping of values makes it hard to understand the orientation of every spoke.


3. Create a scatterplot of Oleic vs Eicosenoic in which color is defined by numeric values of Region. What is wrong with such a plot? Now create a similar kind of plot in which Region is a categorical variable. How quickly can you identify decision boundaries? Does preattentive or attentive mechanism make it possible?

#plotting the third graph
t2 <- textGrob("Comparison of different graphs on \n Dependency of Oleic over Eicosenoic based on Region")
p4<-ggplot(df,aes(x=oleic, y= eicosenoic, color=Region)) + 
  geom_point()
p5<-ggplot(df,aes(x=oleic, y= eicosenoic, color=as.factor(Region))) + 
  geom_point()
graphs <- arrangeGrob(grobs = list(t2,p4,p5), ncol = 1,nrow = 3,heights = c(4,10,10))
oleicXeicosenoicXRegion <- as_ggplot(graphs)

The generated scatter plots looks like ths:

Analysis

Creating the scatterplot by defining numeric values of Region makes it more difficult to interpret the data. Even though there are clear boundaries, we cannot identify which shade of color is mapped to its region. However, creating the scatterplot with Region as a categorical variable makes it quicker to identify the boundaries using a preattentive mechanism to understand which color refers to a region.


4. Create a scatterplot of Oleic vs Eicosenoic in which color is defined by a discretized Linoleic (3 classes), shape is defined by a discretized Palmitic (3 classes) and size is defined by a discretized Palmitoleic (3 classes). How difficult is it to differentiate between 27=3* 3* 3 different types of observations? What kind of perception problem is demonstrated by this graph?

#plotting the fourth graph

intervaldf$linoleicinterval3 <- cut_interval(df$linoleic,n = 3)
intervaldf$palmiticinterval3 <- cut_interval(df$palmitic,n = 3)
intervaldf$palmitoleicinterval3 <- cut_interval(df$palmitoleic,n = 3)

p6<-ggplot(df,aes(x=oleic, y= eicosenoic, color=intervaldf$linoleicinterval3,
                  shape = intervaldf$palmiticinterval3, size = intervaldf$palmitoleicinterval3)) + 
  geom_point()+ 
  ggtitle("Dependency of Oleic over Eicosenoic based on \nLinoleic Level, Palmiticin and Palmitoleic")+
  labs(color = "Linoleic Interval", shape = "Palmiticin Interval", size = "Palmitoleic Interval")

The dependency of Oleic over Eicosenoic based on different Linoleic Level, Palmiticin and Palmitoleic is shown in the following plot:

Analysis

It is very difficult to differentiate between the 3 * 3 * 3 classes of observation because dealing with a large number of categories or dimensions can be mentally taxing for viewers to process and remember the distinctions between them. There’s also a risk of misinterpretation from using too many colors and sizes, which can lead to confusion, and some visual cues are less effective than others. There is a relative judgement error to be noted in the graph. The plotted values have similar positions along the common scale which make them difficult to be recogonised. Added to that problem is areal recognition. Too many features to a single value is tiresome for the brain as the attendive perception uses only short term memory. Even though palmiticin interval uses different hues for effective recognition it does not say anything about its value scale. Seeing different colors would make brain interpret the values to be independent of each other and overlook the fact that each color represent a value in the amount of Palmiticin in the oil. Here perspective problem arises.


5. Create a scatterplot of Oleic vs Eicosenoic in which color is defined by Region, shape is defined by a discretized Palmitic (3 classes) and size is defined by a discretized Palmitoleic (3 classes). Why is it possible to clearly see a decision boundary between Regions despite many aesthetics are used? Explain this phenomenon from the perspective of Treisman’s theory.

#plotting the fifth graph

p7<-ggplot(df,aes(x=oleic, y= eicosenoic, color=Region,
                  shape = intervaldf$palmiticinterval3, size = intervaldf$palmitoleicinterval3)) + 
  geom_point()+ 
  ggtitle("Dependency of Oleic over Eicosenoic based on \nRegion, Palmiticin and Palmitoleic")+
  labs(color = "Region", shape = "Palmiticin Interval", size = "Palmitoleic Interval")

The dependency of Oleic over Eicosenoic based on different Region, Palmiticin and Palmitoleic is shown in the following plot:

Analysis

It is possible to see a decision boundary clearly because according to Treisman’s theory, it is possible to differentiate between colors quickly, which is very obvious in this plot. Once the boundaries have been identified, we can then use focused attention to identify various shapes and gather more data


6. Use Plotly to create a pie chart that shows the proportions of oils coming from different Areas. Hide labels in this plot and keep only hover-on labels. Which problem is demonstrated by this graph?

#plotting the sixth graph

p8 <- plot_ly(data=df,labels=~factor(Area), type = "pie", hoverinfo ='label', textinfo = "none")%>%
  layout(title= "Areal Distribution of Oil Production", showlegend = F)

The resulting graph is as follows:

Analysis

Human perception cannot be relied on when observing this graph because we cannot accurately determine the angle or size of each section by just looking at it. Relative judgement based on angle in compromised here. It causes scaling and perspective problem to arise.


7. Create a 2d-density contour plot with Ggplot2 in which you show dependence of Linoleic vs Eicosenoic. Compare the graph to the scatterplot using the same variables and comment why this contour plot can be misleading.

#plotting the seventh graph
t3 <- textGrob("Comparison of Density contour and Relational Scatter plot between Linoleic and Eicosenoic")
p9<-ggplot(df,aes(x=linoleic, y= eicosenoic)) +
  geom_density_2d()
p10<-ggplot(df,aes(x=linoleic, y= eicosenoic)) + 
  geom_point()

graphs1 <- arrangeGrob(grobs = list(t3,p9,p10), ncol = 1,nrow = 3, heights = c(1,10,10))
linoleicXeicosenoic <- as_ggplot(graphs1)

The resulting graph is as follows:

Analysis

Comparing the two graphs, it is seen that the density plot is obscuring certain patterns. in areas of high density on the scatterplot, the density plot displays a lower value of density, or no data is represented at all. The appearance of the density plot can be highly sensitive to the arrangement and distribution of data points, where the appearance of clusters and patterns may differ, potentially leading to misinterpretation.


Assignment 2

Following are the libraries used for the successful completion of this assignment:
plotly
MASS
xlsx
tidyr

Here is how we loaded our libraries:

library(plotly)
library(MASS)
library(xlsx)
library(tidyr)

1. Load the file to R and answer whether it is reasonable to scale these data in order to perform a multidimensional scaling (MDS).

#Reading the baseball data
baseball = read.xlsx(file = "baseball-2016.xlsx",sheetIndex = 1)
row.names(baseball) <- baseball$Team

The loaded dataframe looks like this:

Team League Won Lost Runs.per.game HR.per.game
Aizona Diamondbacks Aizona Diamondbacks NL 69 93 4.64 1.172840
Atlanta Braves Atlanta Braves NL 68 93 4.03 0.757764
Baltimore Orioles Baltimore Orioles AL 89 73 4.59 1.561728
Boston Red Sox Boston Red Sox AL 93 69 5.42 1.283951
Chicago Cubs Chicago Cubs NL 103 58 4.99 1.236025
Chicago White Sox Chicago White Sox AL 78 84 4.23 1.037037
AB Runs Hits X2B X3B HR
Aizona Diamondbacks 5665 752 1479 285 56 190
Atlanta Braves 5514 649 1404 295 27 122
Baltimore Orioles 5524 744 1413 265 6 253
Boston Red Sox 5670 878 1598 343 25 208
Chicago Cubs 5503 808 1409 293 30 199
Chicago White Sox 5550 686 1428 277 33 168
RBI StolenB CaughtS BB SO BAvg
Aizona Diamondbacks 709 137 31 463 1427 0.261
Atlanta Braves 615 75 34 502 1240 0.255
Baltimore Orioles 710 19 13 468 1324 0.256
Boston Red Sox 836 83 24 558 1160 0.282
Chicago Cubs 767 66 34 656 1339 0.256
Chicago White Sox 656 77 36 455 1285 0.257
OBP SLG OPS TB GDP HBP
Aizona Diamondbacks 0.320 0.432 0.752 2446 117 50
Atlanta Braves 0.321 0.384 0.705 2119 145 59
Baltimore Orioles 0.317 0.443 0.760 2449 119 44
Boston Red Sox 0.348 0.461 0.810 2615 137 43
Chicago Cubs 0.343 0.429 0.772 2359 107 96
Chicago White Sox 0.317 0.410 0.727 2275 122 53
SH SF IBB LOB
Aizona Diamondbacks 43 38 43 1113
Atlanta Braves 64 52 60 1161
Baltimore Orioles 17 36 19 1065
Boston Red Sox 8 40 34 1162
Chicago Cubs 42 37 45 1217
Chicago White Sox 29 44 16 1105

It has 30 observations and 28 variables namely:
Team, League, Won, Lost, Runs.per.game, HR.per.game, AB, Runs, Hits, X2B, X3B, HR, RBI, StolenB, CaughtS, BB, SO, BAvg, OBP, SLG, OPS, TB, GDP, HBP, SH, SF, IBB and LOB.

The Summary of the table is as follows:

##                   Team    League       Won             Lost      
##  Aizona Diamondbacks: 1   AL:15   Min.   : 59.0   Min.   : 58.0  
##  Atlanta Braves     : 1   NL:15   1st Qu.: 71.5   1st Qu.: 73.5  
##  Baltimore Orioles  : 1           Median : 82.5   Median : 79.5  
##  Boston Red Sox     : 1           Mean   : 80.9   Mean   : 80.9  
##  Chicago Cubs       : 1           3rd Qu.: 88.5   3rd Qu.: 90.5  
##  Chicago White Sox  : 1           Max.   :103.0   Max.   :103.0  
##  (Other)            :24                                          
##  Runs.per.game    HR.per.game           AB            Runs            Hits     
##  Min.   :3.770   Min.   :0.7578   Min.   :5330   Min.   :610.0   Min.   :1275  
##  1st Qu.:4.178   1st Qu.:1.0185   1st Qu.:5482   1st Qu.:676.2   1st Qu.:1369  
##  Median :4.465   Median :1.1852   Median :5521   Median :723.0   Median :1410  
##  Mean   :4.478   Mean   :1.1556   Mean   :5519   Mean   :724.8   Mean   :1409  
##  3rd Qu.:4.705   3rd Qu.:1.3039   3rd Qu.:5550   3rd Qu.:762.0   3rd Qu.:1444  
##  Max.   :5.420   Max.   :1.5617   Max.   :5670   Max.   :878.0   Max.   :1598  
##                                                                                
##       X2B             X3B             HR             RBI       
##  Min.   :231.0   Min.   : 6.0   Min.   :122.0   Min.   :575.0  
##  1st Qu.:257.5   1st Qu.:21.0   1st Qu.:165.0   1st Qu.:647.5  
##  Median :276.5   Median :29.0   Median :192.0   Median :687.5  
##  Mean   :275.2   Mean   :29.1   Mean   :187.0   Mean   :691.5  
##  3rd Qu.:288.0   3rd Qu.:33.0   3rd Qu.:210.2   3rd Qu.:731.8  
##  Max.   :343.0   Max.   :56.0   Max.   :253.0   Max.   :836.0  
##                                                                
##     StolenB          CaughtS            BB              SO      
##  Min.   : 19.00   Min.   :13.00   Min.   :382.0   Min.   : 991  
##  1st Qu.: 58.50   1st Qu.:26.50   1st Qu.:452.8   1st Qu.:1228  
##  Median : 76.00   Median :34.00   Median :498.0   Median :1302  
##  Mean   : 84.57   Mean   :33.37   Mean   :502.9   Mean   :1299  
##  3rd Qu.:108.00   3rd Qu.:38.50   3rd Qu.:534.8   3rd Qu.:1356  
##  Max.   :181.00   Max.   :56.00   Max.   :656.0   Max.   :1543  
##                                                                 
##       BAvg             OBP              SLG              OPS        
##  Min.   :0.2350   Min.   :0.2990   Min.   :0.3840   Min.   :0.6850  
##  1st Qu.:0.2482   1st Qu.:0.3160   1st Qu.:0.4027   1st Qu.:0.7245  
##  Median :0.2560   Median :0.3215   Median :0.4170   Median :0.7335  
##  Mean   :0.2553   Mean   :0.3214   Mean   :0.4174   Mean   :0.7387  
##  3rd Qu.:0.2607   3rd Qu.:0.3282   3rd Qu.:0.4300   3rd Qu.:0.7558  
##  Max.   :0.2820   Max.   :0.3480   Max.   :0.4610   Max.   :0.8100  
##                                                                     
##        TB            GDP             HBP              SH              SF       
##  Min.   :2090   Min.   : 88.0   Min.   :33.00   Min.   : 8.00   Min.   :28.00  
##  1st Qu.:2212   1st Qu.:114.8   1st Qu.:44.25   1st Qu.:24.50   1st Qu.:36.00  
##  Median :2292   Median :122.5   Median :53.00   Median :35.50   Median :39.50  
##  Mean   :2304   Mean   :124.0   Mean   :55.03   Mean   :34.17   Mean   :40.47  
##  3rd Qu.:2387   3rd Qu.:136.5   3rd Qu.:61.25   3rd Qu.:42.75   3rd Qu.:43.75  
##  Max.   :2615   Max.   :153.0   Max.   :96.00   Max.   :64.00   Max.   :63.00  
##                                                                                
##       IBB             LOB      
##  Min.   :16.00   Min.   : 965  
##  1st Qu.:21.50   1st Qu.:1061  
##  Median :31.00   Median :1102  
##  Mean   :31.07   Mean   :1097  
##  3rd Qu.:38.50   3rd Qu.:1120  
##  Max.   :60.00   Max.   :1217  
## 
Analysis

Scaling is necessary because we have values ranging from 0.235 up to 5670, which makes the data very complex and difficult to visualize in its raw form.


2. Write an R code that performs a non-metric MDS with Minkowski distance=2 of the data (numerical columns) into two dimensions. Visualize the resulting observations in Plotly as a scatter plot in which observations are colored by League. Does it seem to exist a difference between the leagues according to the plot? Which of the MDS components seem to provide the best differentiation between the Leagues? Which baseball teams seem to be outliers?

#Plotting the first graph
baseball.numeric= scale(baseball[,3:27])
d=dist(baseball.numeric)

res=isoMDS(d,k = 2, p=2)
## initial  value 19.778879 
## iter   5 value 16.074932
## iter  10 value 15.763031
## final  value 15.692462 
## converged
coords=res$points

coordsMDS=as.data.frame(coords)
coordsMDS$Team=rownames(coordsMDS)
coordsMDS$League=baseball$League

b1 <- plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter",mode= "markers" ,color= ~League, hovertext = ~Team, colors = "Set1")

The Scatter plot looks as follows:

Analysis

The position of the teams appear to be almost equally distributed among the leagues. Even then the V2 variable some how can differentiate the Leagues by a faint boundary. Around -1.5 value of the V2 variable we can differentiate between two leagues. Based on the here defined boundary some Teams from the NL League can be categorised as outliers like St. Louis Cardinals, NY Mets, Los Angels Dodgers,San Diego Padres,Philadelphia Phillies, Milwaukee Brewers and Chicago cubs which are located beyond the boundary.


3. Use Plotly to create a Shepard plot for the MDS performed and comment about how successful the MDS was. Which observation pairs were hard for the MDS to map successfully?

#Plotting the second graph
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords))

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])



b2 <- plot_ly()%>%
  add_markers(x=~delta, y=~D, hoverinfo = 'text',
              text = ~paste('Obj1: ', rownames(baseball)[index1],
                            '<br> Obj 2: ', rownames(baseball)[index2]))%>%
  add_lines(x=~sh$x, y=~sh$yf)

The Shepherd’s Plot is as below:

Analysis

The MDS does not seem very successful because a strong linear relationship between dissimilarities was not observed. This can be seen clearly because the shepard plot’s line does not closely follow a straight diagonal line, despite the data points forming a close cluster around the line. The most difficult observation pairs to map successfully were the Minnesota Twins and Aizona Diamondbacks. This pair is seen as an outlier despite not having the highest dissimilarities.


4. Produce series of scatterplots in which you plot the MDS variable that was the best in the differentiation between the leagues in step 2 against all other numerical variables of the data. Pick up two scatterplots that seem to show the strongest (positive or negative)

# Plotting the third graph
baseball$SuperVar <- coordsMDS$V2


b3 <- baseball %>%
  gather(-Team, -League, -SuperVar, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = SuperVar)) +
  geom_point(color ="Blue") +
  facet_wrap(~ var, ncol = 7, scales = "free")

The grid plot looks as below:

Analysis

The two strongest scatterplots are as follows:

SuperVar against SF: On this graph, most of the data points are clustered from left to right in a negative slope, in between values ranging from 30 to 50, with a few outliers located around the 60 mark.

SuperVar against X3B: Here the data points are more clustered towards the center of the graph, in between values of 18 and 40. Some extreme outliers are seen below 10 and around 50.

These two graphs were chosen because they demonstrate a close relationship between the two variables being plotted, making it easier to make predictions and draw meaningful conclusions about the data.


STATEMENT OF CONTRIBUTION

For the first assignment coding was done by Greeshma Jeev and the Analysis part was done by Olayemi. We both went through the outputs and the analysis to make our own suggestions to the results inorder to make this report a grand success.

As for the second assignment since we both are new to the MDS and its application in R, we both sat together and learned on various aspects of MDS and its coding in R by going through Lecture Slides, Textbooks and Web browsing. Different ambiguities aroused while working on it and they were cleared by discussing with different classmates and those which remained even after discussions were rectified by Mr Oleg. After getting a clearer understanding of the assignment the coding for the assignment was done by Greeshma Jeev. As for most of the coding in this assignment templates were already available we both found more time in discussing and defending our analysis and findings in the assignment.

The RMD file was designed together and coded by Greeshma Jeev. Content writing was done by both Olayemi and Greeshma Jeev.


APPENDIX

Code for Assignment 1 (Olive Data)

library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggpubr)
library(plotly)
library(grid)
# Read the file
df <-  read.csv("olive.csv")
sumdf <- summary(df)

#ploting the first graph
p<-ggplot(df,aes(x=palmitic, y=oleic, color=linoleic)) + 
  geom_point()+
  ggtitle("Dependency of Palmitic over Oleic based on Linoleic")

intervaldf <- data.frame(cut_interval(df$linoleic,n = 4))
colnames(intervaldf) <- "linoleicinterval4"

p1<-ggplot(df,aes(x=palmitic, y=oleic, color=intervaldf$linoleicinterval4)) + 
  geom_point()+ 
  ggtitle("Dependency of Palmitic over Oleic based on Linoleic Level")+
  labs(color = "Linoleic Interval")


#plotting the second graph
t1 <- textGrob("Comparison of different graphs on \n Dependency of Palmitic over Oleic based on Linoleic Level")
p1<-ggplot(df,aes(x=palmitic, y=oleic, color=intervaldf$linoleicinterval4)) + 
  geom_point()+ labs(color = "Linoleic Interval")
p2<-ggplot(df,aes(x=palmitic, y=oleic,  size=intervaldf$linoleicinterval4)) + 
  geom_point()+ labs(size = "Linoleic Interval")
p3<-ggplot(df,aes(x=palmitic, y=oleic)) +
  geom_point()+ geom_spoke(aes(angle = as.integer(intervaldf$linoleicinterval4)), radius = 50)
graphs <- arrangeGrob(grobs = list(t1,p1,p2,p3), ncol = 1,nrow = 4,heights = c(4,10,10,10))
palmiticXoleicXlinoleic <- as_ggplot(graphs)
palmiticXoleicXlinoleic 


#plotting the third graph
t2 <- textGrob("Comparison of different graphs on \n Dependency of Oleic over Eicosenoic based on Region")
p4<-ggplot(df,aes(x=oleic, y= eicosenoic, color=Region)) + 
  geom_point()
p5<-ggplot(df,aes(x=oleic, y= eicosenoic, color=as.factor(Region))) + 
  geom_point()
graphs <- arrangeGrob(grobs = list(t2,p4,p5), ncol = 1,nrow = 3,heights = c(4,10,10))
oleicXeicosenoicXRegion <- as_ggplot(graphs)
oleicXeicosenoicXRegion

#plotting the fourth graph

intervaldf$linoleicinterval3 <- cut_interval(df$linoleic,n = 3)
intervaldf$palmiticinterval3 <- cut_interval(df$palmitic,n = 3)
intervaldf$palmitoleicinterval3 <- cut_interval(df$palmitoleic,n = 3)

p6<-ggplot(df,aes(x=oleic, y= eicosenoic, color=intervaldf$linoleicinterval3,
                  shape = intervaldf$palmiticinterval3, size = intervaldf$palmitoleicinterval3)) + 
  geom_point()+ 
  ggtitle("Dependency of Oleic over Eicosenoic based on Linoleic Level, Palmiticin and Palmitoleic")+
  labs(color = "Linoleic Interval", shape = "Palmiticin Interval", size = "Palmitoleic Interval")


#plotting the fifth graph

p7<-ggplot(df,aes(x=oleic, y= eicosenoic, color=Region,
                  shape = intervaldf$palmiticinterval3, size = intervaldf$palmitoleicinterval3)) + 
  geom_point()+ 
  ggtitle("Dependency of Oleic over Eicosenoic based on Region, Palmiticin and Palmitoleic")+
  labs(color = "Region", shape = "Palmiticin Interval", size = "Palmitoleic Interval")


#plotting the sixth graph

p8 <- plot_ly(data=df,labels=~factor(Area), type = "pie", hoverinfo ='label', textinfo = "none")%>%
  layout(title= "Areal Distribution of Oil Production", showlegend = F)


#plotting the seventh graph
t3 <- textGrob("Comparison of Density contour and Relational Scatter plot between Linoleic and Eicosenoic")
p9<-ggplot(df,aes(x=linoleic, y= eicosenoic)) +
  geom_density_2d()
p10<-ggplot(df,aes(x=linoleic, y= eicosenoic)) + 
  geom_point()

graphs1 <- arrangeGrob(grobs = list(t3,p9,p10), ncol = 1,nrow = 3, heights = c(1,10,10))
linoleicXeicosenoic <- as_ggplot(graphs1)

Code for Assignment 2 (Baseball Data)

library(plotly)
library(MASS)
library(xlsx)
library(tidyr)

#Reading the baseball data
baseball = read.xlsx(file = "baseball-2016.xlsx",sheetIndex = 1)
row.names(baseball) <- baseball$Team
summary(baseball)

#Plotting the first graph
baseball.numeric= scale(baseball[,3:27])
d=dist(baseball.numeric)

res=isoMDS(d,k = 2, p=2)
coords=res$points

coordsMDS=as.data.frame(coords)
coordsMDS$Team=rownames(coordsMDS)
coordsMDS$League=baseball$League

b1 <- plot_ly(coordsMDS, x=~V1, y=~V2, type="scatter",mode= "markers" ,color= ~League, hovertext = ~Team, colors = "Set1")

#Plotting the second graph
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords))

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])

n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])



b2 <- plot_ly()%>%
  add_markers(x=~delta, y=~D, hoverinfo = 'text',
              text = ~paste('Obj1: ', rownames(baseball)[index1],
                            '<br> Obj 2: ', rownames(baseball)[index2]))%>%
  add_lines(x=~sh$x, y=~sh$yf)


# Plotting the third graph
baseball$SuperVar <- coordsMDS$V2


b3 <- baseball %>%
  gather(-Team, -League, -SuperVar, key = "var", value = "value") %>% 
  ggplot(aes(x = value, y = SuperVar)) +
  geom_point(color ="Blue") +
  facet_wrap(~ var, ncol = 7, scales = "free")